对抗性训练已被证明是捍卫对抗性例子的最有效的补救措施之一,但通常会遭受在看不见的测试对手身上巨大的稳定性概括差距,被认为是\ emph {对抗性强大的概括性问题}。尽管最初是针对对抗性强大的概括的初步理解,但从建筑的角度来看,知之甚少。因此,本文试图通过系统地检查最具代表性的体系结构(例如,视觉变压器和CNN)来弥合差距。特别是,我们首先对Imagenette和CIFAR-10数据集进行了对抗训练的架构\ Emph {20}对几个对手(多个$ \ ell_p $ -norm -norm对照攻击)的架构,并发现视觉变形金刚(例如,PVT,Coatnet)经常产生更好的对抗性稳定性。为了进一步了解哪种建筑成分有利于对抗性的强大概括,我们深入研究了几个关键的构建块,并通过Rademacher复杂性的镜头揭示了这一事实,即更高的重量稀疏性对更好的对手的视觉变形金刚的强大良好概括有很大贡献,这通常可以实现这一目标,这是可以实现的。通过注意层。我们的广泛研究发现了建筑设计与对抗性稳定的概括之间的密切关系,并实例化了一些重要的见解。我们希望我们的发现可以帮助更好地理解设计强大的深度学习体系结构的机制。
translated by 谷歌翻译
知识图(kg)嵌入是一种主流方法,用于推理不完整的kg。但是,受其固有浅层和静态体系结构的限制,它们几乎无法处理对复杂逻辑查询的不断上升,这些查询包括逻辑运算符,估算的边缘,多个源实体和未知的中间实体。在这项工作中,我们通过掩盖的预训练和微调策略介绍了知识图变压器(kgtransformer)。我们设计了一种kg三重变换方法,以使变压器能够处理kg,这是通过稀疏(MOE)稀疏激活的混合物进一步增强的。然后,我们将复杂的逻辑查询作为掩盖预测提出,并引入了两阶段掩盖的预训练策略,以提高可转移性和概括性。在两个基准上进行的广泛实验表明,KGTRANSFORMER可以始终超过基于KG的基准和九个内域和室外推理任务的高级编码。此外,KGTRANSFORMER可以通过提供解释给定答案的完整推理路径来解释性。
translated by 谷歌翻译
该技术报告提出了一种有效的自动驾驶运动预测方法。我们开发了一种基于变压器的方法,用于输入编码和轨迹预测。此外,我们提出了时间流动头来增强轨迹编码。最后,使用了有效的K均值集合方法。使用我们的变压器网络和集合方法,我们以1.90的最新Brier-Minfde得分赢得了Argoverse 2 Motion预测挑战的第一名。
translated by 谷歌翻译
作为世界各地的Covid-19大流行横冲直撞,对视频会议激增的需求。为此,实时肖像分割成为一种流行的功能,以取代会议参与者的背景。虽然为从生命场景中提取身体姿势的分段提供了具有丰富的数据集,模型和算法,但纵向分割尚未在视频会议上下文中覆盖很好。为了促进该领域的进步,我们介绍了名为PP-Humanseg的开源解决方案。这项工作是第一个构建一个大型视频纵向数据集,其中包含291个会议场景中的291个视频,其中14K细微的帧和扩展到多摄像头电话。此外,我们提出了一种用于语义分割的新型语义连接感知学习(SCL),其引入了语义连接感知丢失,以提高来自连接的角度的分段结果。我们提出了一种超轻量级模型,具有SCL的实际肖像分割,实现IOO之间的最佳权衡和推理的速度。我们数据集的广泛评估展示了SCL和我们的模型的优越性。源代码可在https://github.com/paddlepaddle/paddleseg上获得。
translated by 谷歌翻译
As one of the most important psychic stress reactions, micro-expressions (MEs), are spontaneous and transient facial expressions that can reveal the genuine emotions of human beings. Thus, recognizing MEs (MER) automatically is becoming increasingly crucial in the field of affective computing, and provides essential technical support in lie detection, psychological analysis and other areas. However, the lack of abundant ME data seriously restricts the development of cutting-edge data-driven MER models. Despite the recent efforts of several spontaneous ME datasets to alleviate this problem, it is still a tiny amount of work. To solve the problem of ME data hunger, we construct a dynamic spontaneous ME dataset with the largest current ME data scale, called DFME (Dynamic Facial Micro-expressions), which includes 7,526 well-labeled ME videos induced by 671 participants and annotated by more than 20 annotators throughout three years. Afterwards, we adopt four classical spatiotemporal feature learning models on DFME to perform MER experiments to objectively verify the validity of DFME dataset. In addition, we explore different solutions to the class imbalance and key-frame sequence sampling problems in dynamic MER respectively on DFME, so as to provide a valuable reference for future research. The comprehensive experimental results show that our DFME dataset can facilitate the research of automatic MER, and provide a new benchmark for MER. DFME will be published via https://mea-lab-421.github.io.
translated by 谷歌翻译
Reading comprehension of legal text can be a particularly challenging task due to the length and complexity of legal clauses and a shortage of expert-annotated datasets. To address this challenge, we introduce the Merger Agreement Understanding Dataset (MAUD), an expert-annotated reading comprehension dataset based on the American Bar Association's 2021 Public Target Deal Points Study, with over 39,000 examples and over 47,000 total annotations. Our fine-tuned Transformer baselines show promising results, with models performing well above random on most questions. However, on a large subset of questions, there is still room for significant improvement. As the only expert-annotated merger agreement dataset, MAUD is valuable as a benchmark for both the legal profession and the NLP community.
translated by 谷歌翻译
An increasing number of public datasets have shown a marked clinical impact on assessing anatomical structures. However, each of the datasets is small, partially labeled, and rarely investigates severe tumor subjects. Moreover, current models are limited to segmenting specific organs/tumors, which can not be extended to novel domains and classes. To tackle these limitations, we introduce embedding learned from Contrastive Language-Image Pre-training (CLIP) to segmentation models, dubbed the CLIP-Driven Universal Model. The Universal Model can better segment 25 organs and 6 types of tumors by exploiting the semantic relationship between abdominal structures. The model is developed from an assembly of 14 datasets with 3,410 CT scans and evaluated on 6,162 external CT scans from 3 datasets. We rank first on the public leaderboard of the Medical Segmentation Decathlon (MSD) and achieve the state-of-the-art results on Beyond The Cranial Vault (BTCV). Compared with dataset-specific models, the Universal Model is computationally more efficient (6x faster), generalizes better to CT scans from varying sites, and shows stronger transfer learning performance on novel tasks. The design of CLIP embedding enables the Universal Model to be easily extended to new classes without catastrophically forgetting the previously learned classes.
translated by 谷歌翻译
In recent years, the Transformer architecture has shown its superiority in the video-based person re-identification task. Inspired by video representation learning, these methods mainly focus on designing modules to extract informative spatial and temporal features. However, they are still limited in extracting local attributes and global identity information, which are critical for the person re-identification task. In this paper, we propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules to address the above issue. Specifically, MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips, respectively, achieving the holistic perception of the input person. We combine the outputs of all the stages for the final identification. In practice, to save the computational cost, the Spatial-Temporal Aggregation (STA) modules are first adopted in each stage to conduct the self-attention operations along the spatial and temporal dimensions separately. We further introduce the Attribute-Aware and Identity-Aware Proxy embedding modules (AAP and IAP) to extract the informative and discriminative feature representations at different stages. All of them are realized by employing newly designed self-attention operations with specific meanings. Moreover, temporal patch shuffling is also introduced to further improve the robustness of the model. Extensive experimental results demonstrate the effectiveness of the proposed modules in extracting the informative and discriminative information from the videos, and illustrate the MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
translated by 谷歌翻译
Neural models with an encoder-decoder framework provide a feasible solution to Question Generation (QG). However, after analyzing the model vocabulary we find that current models (both RNN-based and pre-training based) have more than 23\% inflected forms. As a result, the encoder will generate separate embeddings for the inflected forms, leading to a waste of training data and parameters. Even worse, in decoding these models are vulnerable to irrelevant noise and they suffer from high computational costs. In this paper, we propose an approach to enhance the performance of QG by fusing word transformation. Firstly, we identify the inflected forms of words from the input of encoder, and replace them with the root words, letting the encoder pay more attention to the repetitive root words. Secondly, we propose to adapt QG as a combination of the following actions in the encode-decoder framework: generating a question word, copying a word from the source sequence or generating a word transformation type. Such extension can greatly decrease the size of predicted words in the decoder as well as noise. We apply our approach to a typical RNN-based model and \textsc{UniLM} to get the improved versions. We conduct extensive experiments on SQuAD and MS MARCO datasets. The experimental results show that the improved versions can significantly outperform the corresponding baselines in terms of BLEU, ROUGE-L and METEOR as well as time cost.
translated by 谷歌翻译
In this paper, we develop an efficient multi-scale network to predict action classes in partial videos in an end-to-end manner. Unlike most existing methods with offline feature generation, our method directly takes frames as input and further models motion evolution on two different temporal scales.Therefore, we solve the complexity problems of the two stages of modeling and the problem of insufficient temporal and spatial information of a single scale. Our proposed End-to-End MultiScale Network (E2EMSNet) is composed of two scales which are named segment scale and observed global scale. The segment scale leverages temporal difference over consecutive frames for finer motion patterns by supplying 2D convolutions. For observed global scale, a Long Short-Term Memory (LSTM) is incorporated to capture motion features of observed frames. Our model provides a simple and efficient modeling framework with a small computational cost. Our E2EMSNet is evaluated on three challenging datasets: BIT, HMDB51, and UCF101. The extensive experiments demonstrate the effectiveness of our method for action prediction in videos.
translated by 谷歌翻译